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Abstract 



In this paper, we study convex optimization methods for computing the trace norm regular- 
ized least squares estimate in multivariate linear regression. The so-called factor estimation and 
selection (FES) method, recently proposed by Yuan et al. [52], conducts parameter estimation and 
factor selection simultaneously and have been sho-wn to enjoy nice properties in both large and 
finite samples. To compute the estimates, ho-wever, can be very challenging in practice because 
of the high dimensionality and the trace norm constraint. In this paper, -we explore a variant 
of Nesterov's smooth method [20| and interior point methods for computing the penalized least 
squares estimate. The performance of these methods is then compared using a set of randomly 
generated instances. We sho-w that the variant of Nesterov's smooth method [lOj generally out- 
performs the interior point method implemented in SDPT3 version 4.0 (beta) [Hj substantially . 
Moreover, the former method is much more memory efficient. 

Key "words: Cone programming, smooth saddle point problem, first-order method, interior point 
method, multivariate linear regression, trace norm, dimension reduction. 

AMS 2000 subject classification: 90C22, 90C25, 90C47, 65K05, 62H12, 62J05 

1 Introduction 

Multivariate linear regression is routinely used in statistics to model the predictive relationships of 
multiple related responses on a common set of predictors. In general multivariate linear regression, 
-we have I observations on q responses b = (6i, . . . , bg)' and p explanatory variables a = (ai, . . . , Op)', 
and 
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B = AU + E 



(1) 



where B = {h^, . . . ,b')' £ ^^^'^ and A = (a^, . . . ,a')' G K'^^ consists of the data of responses and 
explanatory variables, respectively, U G SR^^"^ is the coefficient matrix, E = (e^,...,e')' S 'Sl^^'^ is 
the regression noise, and all e*s are independently sampled from M{0, S). 

Classical estimators for the coefficient matrix U such as the least squares estimate are known to 
perform sub-optimally because they do not utilize the information that the responses are related. 
This problem is exacerbated when the dimensionality p or g is moderate or large. Linear factor models 
are widely used to overcome this problem. In the linear factor model, the response B is regressed 
against a small number of linearly transformed explanatory variables, which are often referred to as 
factors. More specifically, the linear factor model can be expressed as 

B = Fn + E, (2) 

where E W^'^, and F = AT for some F G K^^*^ and r < m.m{p,q}. The columns of F, namely, 
Fj (j = 1, . . . ,r) represent the so-called factors. Clearly ([2]) is an alternative representation of ([1]) 
with U = TQ, and the dimension of the estimation problem reduces as r decreases. Many popular 
methods including canonical correction (Hotelling [9l [TO] ) , reduced rank (Anderson [1] , Izenman [11] , 
Reinsel and Velu [E]), principal components (Massy [H]), partial least squares (Wold [21]) and joint 
continuum regression (Brooks and Stone 0) among others can all be formulated in the form of linear 
factor regression. They differ in the way in which the factors are determined. 

Given the number of factors r, estimation in the linear factor model most often proceeds in two 
steps: the factors, or equivalently T, are first constructed, and then 0, is estimated by least squares 
for ([2]). It is obviously of great importance to be able to determine r for ([2]). For a smaller number 
of factors, a more accurate estimate is expected since there are fewer free parameters. But too few 
factors may not be sufficient to describe the predictive relationships. In all of the aforementioned 
methods, the number of factors r is chosen in a separate step from the estimation of ([2]) through 
either hypothesis testing or cross-validation. The coefficient matrix is typically estimated on the 
basis of the number of factors selected. Due to its discrete nature, this type of procedure can be very 
unstable in the sense of Breiman [5]: small changes in the data can result in very different estimates. 

Recently, Yuan et al. |22j proposed a novel method that can simultaneously choose the number 
of factors, determine the factors and estimate the factor loading matrix 17. It has been demonstrated 
that the so-called factor estimation and selection (FES) method combines and retains the advantages 
of the existing methods. FES is a constrained least square estimate where the trace norm or the 
nuclear norm (or the Ky Fan m-norm where m := min{p, q}) of the coefficient matrix U is forced to 
be smaller than an upper bound: 

min Tt{{B - AU)W{B - AU)') 

^ ™ (3) 
s.t. Y: ai(U) <M. 

i=l 

where is a positive definite weight matrix. Common choices of the weight matrix W include 
and /. To fix ideas, we assume throughout the paper that W = I. Under this assumption, ([3]) is 
equivalent to 

min \\B - AUWi 
u 

m (4) 

s.t. J2 a,{U) <M. 
1=1 



2 



It is shown in Yuan et al. [22j that the constraint used by FES encourages sparsity in the factor space 
and at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction 
and estimation simultaneously in the multivariate linear model. Recently, Bach [2j further provided 
necessary and sufficient conditions for rank consistency of trace norm minimization with the square 
loss by considering the Lagrangian relaxation of dH). He also proposed a Newton-type method for 
finding an approximate solution to the latter problem. It shall be mentioned that his method is only 
suitable for the problems where p and q are not too large. 

In addition, the trace norm relaxation has been used in literature for rank minimization problem. 
In particular, Fazel et al. [7] considered minimizing the rank of a matrix U subject to U £ C, where 
C is a closed convex set. They proposed a convex relaxation to this problem by replacing the rank of 
U by the trace norm of U. Recently, Recht et al. jl7l showed that under some suitable conditions, 
such a convex relaxation is tight when C is an affine manifold. The authors of [T7] also discussed 
some ffist- and second-order optimization methods for solving the trace norm relaxation problem. 

The goal of this paper is to explore convex optimization methods, namely, a variant of Nesterov's 
smooth method [20], and interior point methods for solving (jj]). We also compare the performance 
of these methods on a set of randomly generated instances. We show that the variant of Nesterov's 
smooth method [20] generally outperforms the interior point method implemented in the code SDPT3 
version 4.0 (beta) \X9\ substantially, and that the former method requires much less memory than 
the latter one. 

The rest of this paper is organized as follows. In Subsection ll.il we introduce the notation that 
is used throughout the paper. In Section [21 we present some technical results that are used in our 
presentation. In Section[3l we provide a simplification for problem and present cone programming 
and smooth saddle point reformulations for it. In Section [J we review a variant of Nesterov's 
smooth method [20j and discuss the details of its implementation for solving the aforementioned 
smooth saddle point reformulations of dl]). In Section [5l we present computational results comparing 
a well-known second-order interior-point method applied to the aforementioned cone programming 
reformulations of ([!]) with the variant of Nesterov's smooth method for solving smooth saddle point 
reformulations of (jl]). Finally, we present some concluding remarks in Section [6] and state some 
additional technical results in the Appendix. 

1.1 Notation 

The following notation is used throughout our paper. For any real number a, [a]"*" denotes the 
nonnegative part of a, that is, [a]"*" = max{a, 0}. The symbol W denotes the p-dimensional Euclidean 
space. We denote by e the vector of all ones whose dimension should be clear from the context. For 
any w G W, T)iag{w) denotes the p x p diagonal matrix whose ith diagonal element is Wi for 
i = 1, . . . ,p. The Euclidean norm in ^R.^ is denoted by || • ||. 

We let 5" denote the space of n x n symmetric matrices, and Z ^ indicate that Z is positive 
semidefinite. We also write 5" for {Z G 5" : Z ^ 0}, and for its interior, the set of positive 
definite matrices in 5". For any Z G S"", we let Aj(Z), for i = l,...,n, denote the ith. largest 
eigenvalue of Z, Amin(-^) (resp., Amax(-^)) denote the minimal (resp., maximal) eigenvalue of Z, and 
define ||Z||oo := ^SiXi<i<n and = "^l^i |Aj(Z)|. Either the identity matrix or operator 

will be denoted by /. 

The space of all p x g matrices with real entries is denoted by JR^^'?. Given matrices X and 
Y in SR^^"^, the standard inner product is defined hy X • Y = Tr{X'^Y), where Tr(-) denotes the 
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trace of a matrix. The operator norm and the Frobenius norm of a p x (/-matrix X are defined as 
||X|| := max{||Xn|| : ||u|| < 1} = [Xi-aax{X'^ X)]^/'^ and ||^||f := VX • X , respectively. Given any 
X G W^'^, we let vec(X) denote the vector in JR^'^ obtained by stacking the columns of X according 
to the order in which they appear in X, and cri(X) denote the ith. largest singular value of X for 
z = 1, . . . , min{p, q}. (Recall that ai{X) = [Ai(X^X)] V2 = [Ai(XX^)] V2 for i = 1, . . . , min{p, q}.) 
Also, let g : Wi sji{p+i)^(j>+i) be defined as 

G{X) := ^ ) , VX G W'<1. (5) 

The following sets are used throughout the paper: 



= {Xe Wi : \\X\\f < r}, 
= {Z eS"" ■.\\Z\\i=r, zyQ}, 
= {Ze5" : ||Z||i < r, 0}, 



where the latter is the well-known p-dimensional second-order cone. 

Let U he a normed vector space whose norm is denoted by || • \\u- The dual space of U, denoted 
by U*, is the normed vector space consisting of all linear functionals of u* : U ^ ^, endowed with 
the dual norm || • ||^ defined as 

= max{(n*,«) : llwHw < 1}, ^u* eU*, 

u 

where {u*,u) := u*{u) is the value of the linear functional u* at u. 

If V denotes another normed vector space with norm || • ||v, and £ :U ^ V* is a linear operator, 
the operator norm of £ is defined as 

\\£\\u,v = max{||£:M||y : \\u\\u < !}• (6) 

A function / : C [/ — 3fi is said to be L-Lipschitz-differentiable with respect to || • if it is 
differentiable and 

\\Vf{u)-Vf{um<L\\u-u\\u, yu,uen. (7) 



2 Some results on eigenvalues and singular values 

In this subsection, we establish some technical results about eigenvalues and singular values which 
will be used in our presentation. 

The first result gives some well-known identities involving the maximum eigenvalue of a real 
symmetric matrix. 

Lemma 2.1. For any Z E and scalars a > and P E^, the following statements hold: 

A„iax(^) = max Z»W, (8) 

weA'^{l) 

[aX^e^iZ) + P]+ = max aZ*W + pTr{W). (9) 
VFeA5(i) 
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Proof. Identity ([8]) is well-known. We have 

[aX^^^iZ) + P]+ = [A^ax(a^ + = max tX^^^iaZ + pi) 

te[o,i] 

max t(aZ + pI)»W = max (aZ + /?/) • W, 
tG[o,i],VKeA^(i) h/gA5{i) 

where the third equality is due to ([8|) and the fourth equality is due to the fact that tW takes all 
possible values in A" (1) under the condition that t G [0, 1] and W E A!i(l). ■ 

The second result gives some characterizations of the sum of the k largest eigenvalues of a real 
symmetric matrix. 

Lemma 2.2. Let Z £ 5" and integer 1 < k < n be given. Then, the following statements hold: 

a) For t £ ?R., we have 

k ( t-ks- Tr(y) > 0, 

^Xi{Z)<t ^ < Y - Z + sl to, 
i=i ( Y t 0, 

for some Y G S" and s £ 

b) The following identities hold: 

k 

VAi(Z) = min max k(Z -Y) + Ti(Y) (10) 

1 = 1 

= max{Z»W :TT(W) = k,0~<W I}. (11) 

c) For every scalar a > and (3 £ ?R., the following identities hold: 

k 



aJ^Ai(Z) + /3 



2=1 



min max k(aZ -Y)»W +18 + Tt(Y)] Tr(W) (12) 
max {aZ + f3t: Tt(W) = tk, ~< W ~< tl , < t < 1T13) 



Proof, a) This statement is proved on pages 147-148 of Ben-Tal and Nemirovski [3]. 
b) Statement (a) clearly implies that 

k 

^Xi{Z)= mm^Jks + Ti{Y) : Y + sl t Z,Y to}. (14) 



i=l 



Noting that the condition Y + sl t Z is equivalent to s > XjaaxiZ — Y), we can eliminate the variable 
s from the above min problem to conclude that 

k 

Y,UZ)=mm{kX^^^iZ-Y) + TT{Y):Y £Sl}. (15) 

i=l 
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This relation together with clearly implies identity (jlOp . Moreover, noting that the max problem 
(jlip is the dual of min problem (jl4p and that they both have strictly feasible solutions, we conclude 
that identity (|lip holds in view of a well-known strong duality result. 

c) Using (fT5|) . the fact that infa;ex[a^]^ = [inf X]"*" for any X C 3ft and ([9]), we obtain 

k 1 + 



j=i ^ ^ 



min /cAmax 



mm 

ye5" 



kXr 



aZ+^I-Y] +TV(y) 
aZ + ^ / - y ) + T¥(y) 



/3 



= min max aZ + - / - Y • + Tr(y)Tr(T4^), 

from which (|12p immediately follows. Moreover, using (jlip . the fact that [7]"*^ = max^g[o,i] ^7 for 
every 7 G 3ft and performing the change of variable Y = tY in the last equality below, we obtain 



4 = 1 



i=l ^ ' 



max 

ye5" 



max 



max 
yG5",tG5R 



aZ + : Tr(y) = A;, ^ Y ^ / 

t ^aZ + ^ • y : Tr(y) = k, :< Y :< 1 , < t < 1 
aZ+^n»Y: Tr(y) = iA:, ^ F ^ t/, < t < 1 



i.e., ^ holds. 



Lemma 2.3. iei X G 3ft^^'^ 6e given. Then, the following statements hold: 

a) the p + q eigenvalues of the symmetric matrix GiX) defined in arranged in nonascending 
order, are 

ai{X),--- ,am{X),0,--- ,0,-am{X),--- ,-ai{X), 
where m := min(p, g); 

b) For any positive integer k < m, we have 

k k 

i=l i=l 

Proof. Statement (a) is proved on page 153 of [3] and statement (b) is an immediate consequence 
of (a). ■ 

The following result about the sum of the k largest singular values of a matrix follows immediately 
from Lemmas 12.21 and 12.31 
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Proposition 2.4. Let X € 3^^^'' and integer 1 < k < m.m{p, q} be given and set n := p + q. Then: 
a) For t £ ^, we have 



^cri(X) <t ^ < 



i=l 



for some 1" G 5" and s € 5R; 
b) The following identities hold: 



' t-ks-TxiJ) > 0, 

Y-g{x) + si y 0, 
Y y 0, 



i=l 



mill max k(g(X) - Y) • W + TiiY) 

VgSJ14^GA^{1) 

max {G(X) • W : Tr(H^) = fc, ^ ^ /}. 



(16) 
(17) 



c) For every scalar a > and (3 G ?fi, the following identities hold: 

k 



a^a,(X)+/3 



i=l 



= min max k(ag(X) - Y) + \d + Tt(Y)] Tt(W) (18) 
ye5!f VFeA5(i) 

max {ag(X) •W + (3t: Tt(W) = tk, W tl, < t < 1} . (19) 



3 Problem reformulations 

This section consists of ttiree subsections. The first subsection shows that the restricted least squares 
problem ^ can be reduced to one which does not depend on the (usually large) number of rows of 
the matrices A and/or B. In the second and third subsections, we provide cone programming and 
smooth saddle point reformulations for ([4]), respectively. 



3.1 Problem simplification 

Observe that the number of rows of the data matrices A and B which appear in (jl]) is equal to the 
number of observations /, which is usually quite large in many applications. However, the size of the 
decision variable C/ in ([3]) does not depend on /. In this subsection we show how problem ^ can be 
reduced to similar types of problems in which the new matrix A is apx p diagonal matrix and hence 
to problems which do not depend on /. Clearly, from a computational point of view, the resulting 
formulations need less storage space and can be more efficiently solved. 

Since in most applications, the matrix A has full column rank, we assume that this property 
holds throughout the paper. Thus, there exists an orthonormal matrix Q G ^p^p and a positive 
diagonal matrix A G W^p such that A'^A = QA'^Q^. Letting 

Q'^U, H := A-^Q^A^B, (20) 



X 



we have 



\B 



AUfp 



\B\ 



\\AU\\l-2iAU) 
Ti{U^QK^Q'^U) 
\\KXfp - 2[KX) 



,B = Tt{U^A^AU) 
- 2 Tr{U^QAH) 
H = \\AX-Hfp- 



2Tr(?7^A^S) 
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Noting that the singular values of X = Q'^U and U are identical, we immediately see from the above 
identity that (jl]) is equivalent to 

min \\\KX-H\\l 

ra (21) 
s.t. ^ai(X)<M, 

where A and H are defined in (|2U|) . 

In view of Theorem 16.21 we observe that for any A > and e > 0, any e-optimal solution of 
the following Lagrangian relaxation problem 

^ m 

min -||AX-i/||| + A^ai(X). (22) 

i=l 

is an e-optimal solution of problem ([2T|) with M = (Ti(Xe). In practice, we often need to solve 
problem (j21|) for a sequence of M values. Hence, one way to solve such problems is to solve problem 
()22p for a sequence of A values. 

We will later present convex optimization methods for approximately solving the formulations 
([2T]) and ([22]) . and hence, as a by-product, formulation 

Before ending this subsection, we provide bounds on the optimal solutions of problems ([2TI1 and 
(1221). 



Lemma 3.1. For every M > 0, problem ^21\) has a unique optimal solution X"^. Moreover, 

2\\AH\\f 



\\X*m\\f < := min| (A) ' ^^^^ 

Proof. Using the fact that A is apxp positive diagonal matrix, it is easy to see that the objective 
function of ()2ip is a (quadratic) strongly convex function, from which we conclude that ()21|) has a 
unique optimal solution X"^. Since ||//|||,/2 is the value of the objective function of (pT|) at X = 0, 
we have HAX^I^^ - H\\j;,/2 < \\H\\p/2, or equivalently ||AX|^|||, < 2{AH) • Xl.j. Hence, we have 

(A^in(A))2 \\XMl < \\AXl,\\l < 2{AH) . XI, < 2\\XMf \\AH\\f, 

which implies that \\XIj\\f < 2\\AH\\F/X'^i^{A). Moreover, using the fact that \\X\\j, = XlI^Li crf{X) 
for any X G ?RP^^, we easily see that 

m 

\\X\\F<Y,<'ii^)- (24) 

i=l 



Since X'^ is feasible for (I2ip . it then follows from (I24p that HXjl^jHj? < M. We have thus shown that 
inequality ([23]) holds. ■ 

Lemma 3.2. For every A > 0, problem i22\) has a unique optimal solution X^. Moreover, 

WXIWf < Y^cT.iXl) < r, := min Uz, ^.^(A-^i^) . (25) 

i=l I i=l ) 
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Proof. As shown in Lemma |3. 11 the function X G W^'^ \\^^ ~ -f^ll^ is a (quadratic) strongly 
convex function. Since the term A^^j^(7i(X) is convex in X, it follows that the objective function 
of (|22p is strongly convex, from which we conclude that (j22|) has a unique optimal solution X^ . Since 
||i7|||,/2 is the value of the objective function of (p2|) at X = 0, we have 

m ^ m ^ 

XY.^^{Xl) < -\\AXl-H\\l + xY,<^^{Xl) < -\\H\\l. (26) 

1=1 i=l 

Also, considering the objective function of (j22p at X = A~^H, we conclude that 

m ^ mm 

A^^7,(X;;) < -\\AXl-H\\l + xY,ai{Xl) < xJ2^,{A~'H). (27) 

i=l i=l 1=1 

Now, ([25]) follows immediately from ([26]) and ([27]). ■ 
3.2 Cone programming reformulations 

In this subsection, we provide cone programming reformulations for problems (|21|) and (j22|) . respec- 
tively. 



Proposition 3.3. Problem ^23\} can be reformulated as the following cone programming: 

min 2r + Xt 

r,s,t,X,Y 

( r + 1 \ 
s.t. r-l e£™+2, 

V vec(AX -H) ) (28) 

Y -g{X) + sIh^, 

ms + Tr(y) - t < 0, F ^ 0, 

w/iere (r, s, X, y) € 3? x 3? x 3f? x sRp^^" x 5" mi/i n:=p + q and g{X) is defined in (0). 
Proof. We first observe that (j22p is equivalent to 

min 2r + Xt 

r,X 

s.t. ||AX - H\\l < 4r (29) 

j:<^i{x)-t<o. 

1=1 

Using Lemma 12.31 and the following relation 

/ r + 1 

4r > ||f Ip 44- 

for any v ^ and r G 3f?, we easily see that (f29]l is equivalent to ([28]l 
The following proposition can be similarly established. 




9 



Proposition 3.4. Problem h21\) can he reformulated as the following cone programming: 



mm 

r,s,X,Y 



2r 



s.t. 



r+l 
r — 1 
vec{AX - H) 




(30) 



Y -g{x) + sihQ, 

ms + Tr(y) < M, Y h 0, 
where (r, s,X,Y) e ^ x ^ x W^'^ x 5" with n := p + q and G{X) is defined in l^j. 

3.3 Smooth saddle point reformulations 

In this section, we provide smooth saddle point reformulations for problems (i2T]) and ((22]) . 
3.3.1 Smooth saddle point reformulations for (122[> 

In this subsection, we reformulate (|22p into a smooth saddle point problem that can be suitably 
solved by a variant of Nesterov's smooth method as described in Subsections 14.11 and l4.2i 

We start by introducing the following notation. For every t > 0, we let denote the set defined 

as 



Theorem 3.5. For some e > 0, assume that X^ is an e-optimal solution of the smooth saddle point 
problem 



where Q{X) and r^ are defined in ^ and i25\} . respectively. Then, X^ is an e-optimal solution of 
problem (2^) . 

Proof. This result follows immediately from Lemma 13.21 and relations (|17|) with k = m, (j22|) and 
dni) with t = l. m 

In addition to the saddle point (min-max) reformulation ([32]), it is also possible to develop an 
alternative saddle point reformulation based on the identity (jl6p . These two reformulations can 
in turn be solved by a suitable method, namely Nesterov's smooth approximation scheme p5], for 
solving these min-max type problems, which we will not describe in this paper. In our computational 
experiments, we found that, among these two reformulations, the first one is computationally superior 
than the later one. Details of the computational comparison of these two approaches can be found 
in the technical report (see [13]), which this paper originated from. 

A more efficient method than the ones outlined in the previous paragraph for solving (j22p is 
based on solving the dual of ([32]) . namely the problem 



:= {W G 5^+^ :0^W ^ tI/m,Tr{W) = t}. 



(31) 




(32) 




(33) 



10 



whose objective function has the desirable property that it has Lipschitz continuous gradient (see 
Subsection 14.21 for specific details). In Subsections 14.11 and 14.21 we describe an algorithm, namely, a 
variant of Nesterov's smooth method, for solving (|33p which, as a by-product, yields a pair of primal 
and dual nearly-optimal solutions, and hence a nearly-optimal solution of (j32p . Finally, Section [5] 
only reports computational results for the approach outlined in this paragraph since it is far superior 
than the other two approaches outlined in the previous paragraph. 

3.3.2 Smooth saddle point reformulations for (12 ip 

In this subsection, we will provide a smooth saddle point reformulation for (j2ip that can be suitably 
solved by a variant of Nesterov's smooth method as described in Subsection 14. 1[ 

By directly applying Theorem 16. II to problem (|2ip . we obtain the following result. 

_ "I _ 

Lemma 3.6. Let m := min(j), g). Suppose that X E W^'^ satisfies ^ i7j(X) < M and let ^ be a 

i=l 

scalar such that 7 > 7, where 7 is given by 

^JM^. (34) 

i=l 

Then, the following statements hold: 

a) The optimal values of h21\) and the penalized problem 



1 

xeKp'x'j I 2 



min { ^\\kX - H\\l +7 



(35) 



coincide, and the optimal solution solution of h21\) is an optimal solution of h3^\l : 
b) if e >0 and X^ is an e-optimal solution of problem ^35\). then the point X^ defined as 



m 



- + 



i=l 



(36) 



x, + ex 

X ■=———, where e-. 

+ ^ M-j:a,{X) 

i=l 

is an e-optimal solution of (2l\) . 
We next provide a smooth saddle point reformulation for problem ()2ip . 

_ m _ 

Theorem 3.7. Let m := min(p, g). Suppose that X £ ?R.p^'^ satisfies ^ii-^) < ^'^'^ j be a 

i=l 

scalar such that 7 > 7, where 7 is defined in |3^[ ). For some e > 0, assume that X^ is an e-optimal 
solution of the problem 

min max \-\\AX-H\\l + -f{mg{X)»W-Mt)\, (37) 

xeBP/''{f^){t,w)(^n [2 J 
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where rx is defined in i23\) and 0, is defined as 



n ■= {{t,W) e^xS' 



(38) 



Let be defined in /i36\) . Then, X"" is an e- optimal solution of ^21\). 

Proof. Let denote the unique optimal solution of (j2ip . Then, is also an optimal solution 
of ([35|) in view of Lemma [3T6l f a) . and X^.^ satisfies Xl^ G B^''{rx) due to Lemma [STTl Also, relation 
(fT9ll with a = 1, P = —M and k = m implies that the objective functions of problems (f35]) and ([37|) 
are equal to each other over the whole space K^^"?. The above observations then imply that X^^ is 
also an optimal solution of (|37p and that problems (j35p and (|37p have the same optimal value. Since 
by assumption is an e-optimal solution of (|37p. it follows that is also an e-optimal solution of 
problem (j35p . The latter conclusion together with Lemma 13.6( b) immediately yields the conclusion 
of the theorem. ■ 

The saddle point (min-max) reformulation (j37p can be solved by a suitable method, namely, 
Nesterov's smooth approximation scheme |16j, which we will not describe in this paper. A more 
efficient method for solving (j2ip is based on solving the dual of (j37p . namely the problem 



whose objective function has the desirable property that it has Lipschitz continuous gradient (see 
Subsection 14.31 for specific details). In Subsections 14.11 and 14.31 we describe an algorithm, namely a 
variant of Nesterov's smooth method, for solving (j39p which, as a by-product, yields a pair of primal 
and dual nearly-optimal solutions, and hence a nearly-optimal solution of (|37p . 

4 Numerical methods 

In this section, we discuss numerical methods for solving problem ()22p . More specifically. Subsection 
14.11 reviews a variant of Nesterov's smooth method [20] , for solving a convex minimization problem 
over a relatively simple set with a smooth objective function that has Lipschitz continuous gradient. 
In Subsections 14. 21 and 14. 3^ we present the implementation details of the variant of Nesterov's smooth 
methd for solving the reformulations ([33]) of problem ([22|) and ([39]) of problem ([2T]) . respectively. 

The implementation details of the other formulations discussed in the paper, more specifically, 
the reformulations ()32p of problem ()22p and (I37p of problem (I2ip will not be presented here. The 
implementation details of some other reformulations of problems (|22p and ([2ip can be found in 
Subsection 4.2 of [l3]. 

4.1 Review of a variant of Nesterov's smooth method 

In this subsection, we review a variant of Nesterov's smooth first-order method |15] [T6] that is 
proposed by Tseng [20j for solving a class of smooth convex programming (CP) problems. 

Let U and V be normed vector spaces with the respective norms denoted by \\-\\u and || • ||v. We 
will discuss a variant of Nesterov's smooth first-order method for solving the class of CP problems 




max mm 




(39) 



min/(u) 



(40) 
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where the objective function f : U ^ ?R. has the form 

f{u):=max(p{u,v), \/u G U, (41) 

for some continuous function (j) : U x V ^ ^ and nonempty compact convex subsets U ^ lA and 
1/ C V. We make the following assumptions regarding the function (/>: 

B.l for every u £ U, the function (p{u, ■) : V ^ ?R- is strictly concave; 

B.2 for every v ^V, the function (p{-,v) : U ^ ?R. is convex differentiable; 

B.3 the function / is L-Lipschitz-differentiable on U with respect to || • Wu (see ([7])). 

It is well-known that Assumptions B.l and B.2 imply that the function / is convex differentiable, 
and that its gradient is given by 

Vf{u)=VuHu,v{u)), yu£U, (42) 

where v{u) denotes the unique solution of (|4ip (see for example Proposition B.25 of [4j). Moreover, 
problem (j40p and its dual, namely: 

m.ax{g{v) := min(j){u,v)}, (43) 

both have optimal solutions u* and v* such that f{u*) = g{v*). Finally, using Assumption B.3, 
Lu [12] recently showed that problem (j40p - ()4ip and its dual problem (j43p can be suitably solved by 
Nesterov's smooth method |16] . simultaneously. We shall notice, however, that Nesterov's smooth 
method [16] requires solving two prox-type subproblems per iteration. More recently, Tseng [20j 
proposed a variant of Nesterov's smooth method described as follows, which needs to solve one prox 
subproblem per iteration only. 

Let Pu : [/ — > be a differentiable strongly convex function with modulus au > with respect 
to II • Wu, i.e., 

Puiu) > Pu{u) + {Vpuiu),u - u) + —\\u - u\\lf, 'iu,ueU. (44) 
Let uq be defined as 

uq = argmin{pt/(ti) : u G [/}. (45) 

By subtracting the constant Pu(uq) from the function pui ), we may assume without any loss of 
generality that pu{uo) = 0. The Bregman distance dp^ : U x U — > associated with pjj is defined as 

dp^{u;u) = pu{u) - lp^{u;u), yu,ueU, (46) 

where Ip^ : U x U — > is the "linear approximation" of pu defined as 

Ipjj (n; n) = pu (tt) + {Vpu {u),u — u), \/{u,u) £ U x U. 

Similarly, we can define the function If {■',■) that will be used subsequently. 

We now describe the variant of Nesterov's smooth method proposed by Tseng [20] for solving 
problem ([im) - (jlT|) and its dual problem (03]). It uses a sequence {ak}k>o of scalars satisfying the 
following condition: 

0<ak<\^^aij ,yk>0. (47) 



13 



Clearly, ([171) implies that oq S (0, 1]. 
Variant of Nesterov's smooth algorithm: 

Let uq £ U and {afc}fc>o satisfy (|15|) and (HZl), respectively. 
Set Uq"^ = no, Wo = G V, To = 1 and /c = 1; 

1) Compute v{uk-i) and Vf{uk-i)- 

2) Compute (uf , u^^) e C/ x [/ and E F as 



u 



ag 
k 



argmin < — dp^{u; uq) + //(n; Uj) : n G U> (48) 

^ {l-Tk-i)uti + Tk-iul<^. 

3) Set Tk = ak/{Yli=o ™d Uk = {I - Tk)uf + TkU^^ . 

4) Set A; ^ A; + 1 and go to step 1). 
end 

We now state the main convergence result regarding the variant of Nesterov's smooth algorithm 
for solving problem (j40p and its dual (j43p . Its proof is given in Corollary 3 of Tseng [20j. 

Theorem 4.1. The sequence {{u'lf ,Vk)^ ^ U x V generated by the variant of Nesterov's smooth 
algorithm satisfies 



0<f{uf)-9{vk)< ^^Zi Vfe>l, (49) 
^u{L.i=Q ai) 

where 



Du = max{p{/(u) : u G [/}. (50) 

A typical sequence {afe} satisfying (flTI) is the one in which ak = {k + l)/2 for all /c > 0. With 
this choice for {a^}, we have the following specialization of Theorem 14.11 

Corollary 4.2. If ak = {k + l)/2 for every k >0, then the sequence {{ulfjVk)} U x V generated 
by the variant of Nesterov's smooth algorithm satisfies 

where Djj is defined in l\5(j\) . Thus, the iteration- complexity of finding an e-optimal solution to ^JW 
and its dual |^3[ ) by the variant of Nesterov's smooth algorithm does not exceed 2[{LDi/) / {ajjf-)]^''^ ■ 

Before ending this subsection, we state sufficient conditions for the function (p to satisfy Assump- 
tions B.1-B.3. The proof of the following result can be found in Theorem 1 of |16j . 



Proposition 4.3. Let a norm \\ ■ ||v on V be given. Assume that cp : U x V ^ ^ has the form 

(piu, v) = 9{u) + (n, £v) - h{v), V(u, v)£U xV, (51) 
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where £ : V ^ U* is a linear map, 9 : U ^ ?R. is Lo-Lipschitz-differentiable in U with respect to \\ ■ Wu, 
and h : V ^ ^ is a differentiable strongly convex function with modulus ay > with respect to \\ ■ ||v. 
Then, the function f defined by ( [^ip is {Lq + y/ ay) -Lipschitz- differentiable in U with respect to 
II • 11^^. As a consequence, (p satisfies Assumptions B.1-B.3 with norm \\ ■ \\u and L = Lg + \\£\\i^ yjcry ■ 

We will see in Section H] that all saddle-point reformulations ([I0l) - ()1T]) of problems (pT]) and (p2]) 
studied in this paper have the property that the corresponding function can be expressed as in 
(EH). 



4.2 Implementation details of the variant of Nesterov's smooth method for ( 1331) 

The implementation details of the variant of Nesterov's smooth method (see Subsection 14. ip for 
solving formulation ([33]) (that is, the dual of ([32]) ) are addressed in this subsection. In particular, 
we describe in the context of this formulation the prox-function, the Lipschitz constant L and the 
subproblem ([18]) used by the variant of Nesterov's smooth algorithm of Subsection 14.11 
For the purpose of our implementation, we reformulate problem ([33p into the problem 



mm max 



|-Amr,C?(X) . W - ^||r,AX - i/||^| (52) 



obtained by scaling the variable X of ([33p as X ^ X/r^^ and multiplying the resulting formulation 
by —1. From now on, we will focus on formulation (|52p rather than (|33p . 
Let n := p + q, u := W, v := X and define 

U ■.= niQ 5" =: U, 

V := C =: V, 

and 

(p{u,v) := -Xmr^g^v) •u- -\\rr,Av - H\\p, y{u,v) eU xV, (53) 



where 0,i is defined in pip. Also, assume that the norm on U is chosen as 

\\u\\u := II^^IIf) G U. 

Our aim now is to show that </> satisfies Assumptions B.1-B.3 with || • Wu as above and some Lipschitz 
constant L > 0, and hence that the variant of Nesterov's method can be applied to the corresponding 
saddle-point formulation ([52p . This will be done with the help of Proposition HT3l Indeed, the function 
(j) is of the form (|5ip with ^ = and the functions £ and h given by 

£v := —XmrxGiv), \/v £ V, 

h{v) := ^\\rxAv-H\\j,, G V. 

Assume that we fix the norm on V to be the Frobenius norm, i.e., || • ||v = || • ||ir. Then, it is easy to 
verify that the above function h is strongly convex with modulus ay '■= r^/||A~^|p with respect to 
II ■ llv = II ■ ||f- Now, using ([6]), we obtain 

\\£\\u,v = max{||Amr^t/(i;)||^ : v € V , \\v\\\; < 1} , 
= Amr^; max{||^(u)||i? : G V, ||u||_f < 1} , 

= Amra; max |\/2||t>||ir : f G V, ||f ||f < l| = V^Amrx. (54) 
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Hence, by Proposition 14.31 we conclude that (j) satisfies Assumptions B.1-B.3 with || • Wu = \\ ■ \\f and 

L = \\£\\ly/av = 2A2m2||A-i||2. 

The prox-function pu{-) for the set U used in the variant of Nesterov's algorithm is defined as 

Pu{u) = Tr(nlogn) +logn, \/u £ U = ili. (55) 

We can easily see that pu{') is a strongly differentiable convex function on U with modulus ajj = m 
with respect to the norm || • \\k = \\ ■ \\f. Also, it is easy to verify that mm{pij{u) : u £ U} = and 
that 

uq := argminp[/(u) = //n, (56) 
Dfj := maxp[/(n) = log(n/m). 

As a consequence of the above discussion and Theorem 14. 2 ( we obtain the following result. 

Theorem 4.4. For a given e > 0, the variant of Nesterov's smooth method applied to i l^^p finds an 
e-optimal solution of problem i33\) and its dual, and hence of problem \2S\) . in a number of iterations 
which does not exceed 

^2\/2A||A-i|| ^- — ^-^^1 , , 
j= yra\og{n/m) . (57) 

We observe that the iteration-complexity given in (157p is in terms of the transformed data of 
problem We next relate it to the original data of problem 



Corollary 4.5. For a given e > 0, the variant of Nesterov's smooth method applied to h33\) finds an 
e-optimal solution of problem l^3S\) and its dual, and hence of problem Ii2^) . in a number of iterations 
which does not exceed 

^2^/2A||(A^A)-i/2| 



■\/m log(n/m) 

Proof. We know from Subsection 13.11 that A^A = QA^Q^, where Q G is an orthonormal 

matrix. Using this relation, we have 

||A-1|| = ||A-2||l/2 = ||(A^A)-lf/2 = ||(A^A)-1/2||. 

The conclusion immediately follows from this identity and Theorem 14.41 ■ 

It is interesting to note that the iteration-complexity of Corollary 14.51 depends on the data matrix 
A but not on B. Based on the discussion below, the arithmetic operation cost per iteration of the 
variant of Nesterov's smooth method when applied to problem ()39p is bounded by 0{mpq) where 
m = min(p, q), due to the fact that its most expensive operation consists of finding a partial singular 
value decomposition oi a p x q matrix h as in ()60p . Thus, the overall arithmetic-complexity of the 
variant of Nesterov's smooth method when applied to (I33p is 

O ( ^^^^"^^^^''^^^^ m^/^pq^login/m)] . 
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After having completely specified all the ingredients required by the variant of Nesterov's smooth 
method for solving (j52p . we now discuss some of the computational technicahties involved in the 
actual implementation of the method. 

First, recall that, for a given u £ U, the optimal solution for the maximization subproblem 
(j41|) needs to be found in order to compute the gradient of V/(n). Using (j53|) and the fact that 

V = B^'^{1), we see that the maximization problem (j4ip is equivalent to 

min —\\rx Av - Hfp + G»v, (58) 

where G := G*{u) G W^"^. We now briefly discuss how to solve (j58|) . For any ^ > 0, let 
viO = {rlK' + iir\r,AH - G), ^{0 = WviOfp " 1- 

If ^(0) < 0, then clearly v{0) is the optimal solution of problem ()58p . Otherwise, the optimal solution 
of problem ([58]) is equal to v{£,*), where ^* is the root of the equation ^(^) = 0. The latter can be 
found by well-known root finding schemes specially taylored for solving the above equation. 

In addition, each iteration of the variant of Nesterov's smooth method requires solving subproblem 
()48l) . In view of (I42p and (I53p . it is easy to see that for every u G U, we have V/(n) = G{v) for some 

V G W^'^. Also, Vpu{uo) = (1 — logn)/ due to ([55]) and ([56]) . These remarks together with ([^6]) and 
([55p imply that subproblem (I48p is of the form 

min {ql + g{h))»u + Tr{ulogu) (59) 

for some real scalar and h G 3^^^'', where ili is given by ([3ip . 

We now present an efficient approach for solving ([59p which, instead of finding the eigenvalue 
factorization of the (p + g)-square matrix ql + ^(/i), computes the singular value decomposition of 
the smaller p x g-matrix h. First, we compute a singular value decomposition of h, i.e., h = UT,V'^, 
where [/ G SRf^™, ^ G 3f?«^™ and S are such that 

U^U = I, S = Diag(ai(/i),...,a™(/i)), V^V = I, (60) 

where (Ti{h), . . . ,amih) are the m = min(p, g) singular values of h. Let and r/j denote the ith 
column of U and V, respectively. Using ([5]), it is easy to see that 

r = -L(|).,= l.....„; /- = -L(_''. )., = 1.....™. (61) 

are orthonormal eigenvectors of Gih) with eigenvalues ai{h), . . . , am{h), —ai{h), . . . , —am{h), respec- 
tively. Now let /* G 3ft" for i = 2m -|- 1, . . . , n be such that the matrix F := {f^ , f"^, . . . , /") satisfies 
F^F = /. It is well-known that the vectors /* G K", i = 2m + 1, . . . ,n, are eigenvectors of G{h) 
corresponding to the zero eigenvalue (e.g., see [3]). Thus, we obtain the following eigenvalue decom- 
position of + G{h): 

?I + g{h) = FDiag(a)F^, a = c^e + {ai{h), am{h), -ai{h), . . . , -c7^(/i), 0, . . . , 0)^. 
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Using this relation and (j3ip with t = 1, it is easy to see that the optimal solution of (j59p is v* = 
FDiag{w*)F^ , where w* G is the unique optimal solution of the problem 

min a^w + log w 

s.t. e'^w = 1, (62) 
< w < e/m. 

It can be easily shown that w* = min{exp(— Oj — 1 — ,^*), 1/m}, where is the unique root of the 
equation 

n 

min{exp(— — 1 — ^), 1/m} — 1 = 0. 

i=l 

Let := min{exp(— <j — 1 — ^*), 1/m}. In view of the above formulas for a and w* , we immediately 
see that 

W2m+1 = W2m+2 = ■■■ =wl = ^. (63) 

Further, using the fact that FF^ = /, we have 

n 2m 
i=2m+l i=l 

Using this result and (I63p . we see that the optimal solution v* of (j59p can be efficiently computed as 

n 2m 

= FI)iag{w*)F^ = ^ «;*r (f )'^ = + - ^)r (f)^, 

2=1 i=l 

where the scalar ^ is defined above and the vectors {/* : i = 1, . . . 2m} are given by (|6ip . 

Finally, to terminate the variant of Nesterov's smooth method, we need to evaluate the primal 
and dual objective functions of problem ()52p . As mentioned above, the primal objective function 
f{u) of (j52p can be computed by solving a problem of the form (j58p . Additionally, in view of (|17p 
and ([3T]) . the dual objective function g{v) of ([52]) can be computed as 

^ m 

1=1 

4.3 Implementation details of the variant of Nesterov's smooth method for (1391) 

The implementation details of the variant of Nesterov's smooth method (see Subsection 14. ip for 
solving formulation ([391) (that is, the dual of (f37|) ) are addressed in this subsection. In particular, 
we describe in the context of this formulation the prox- function, the Lipschitz constant L and the 
subproblem ()48p used by the variant of Nesterov's smooth algorithm of Subsection 14.11 
For the purpose of our implementation, we reformulate problem (|39p into the problem 

min max < —j[mfx g{X)»W-Mt]--\\f.xAX-Hfp} (64) 
{t,w)&nxeBP/''(i) [ 2 J 
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obtained by scaling the variables X of (j39p as X ^ X/f^, and multiplying the resulting formulation 
by —1. From now on, our discussion in this subsection will focus on formulation ()64p rather than 

Let n := p + q, u := {t, W), v := X and define 

[/ := C SR X cS" =: Z^, 
V := C =: V 

and 

4){u,v) := -'yimra^Giv) - Mt]- -\\r:rAv - H\\1^, y{u,v)eUxV, (65) 
where is defined in ()38p . Also, assume that the norm on W is chosen as 

\\u\\u ■■= iCt^ + \\W\\l)^/^ Vn = (t, W) £ U, 

where ^ is a positive scalar that will be specified later. Our aim now is to show that cj) satisfies 
Assumptions B.1-B.3 with || • Wu as above and some Lipschitz constant L > 0, and hence that the 
variant of Nesterov's method can be applied to the corresponding saddle-point formulation (j64p . 
This will be done with the help of Proposition I4.3[ Indeed, the function cp is of the form ()5ip with 
9, £ and h given by 

e{u) := jMt, yu = {t,W) eU, 

£v := {0, —^mfxG{i^)), Vv G V, (66) 
h{v) := ^Wr^r^v - Hfp, Vf G V. 

Clearly, is a linear function, and thus it is a 0-Lipschitz-differentiable function on U with respect to 
II • Now, assume that we fix the norm on V to be the Frobenius norm, i.e., || • ||v = || • ||f- Then, 
it is easy to verify that the above function h is strongly convex with modulus cry := f^./||A~"'^ |p with 
respect to || • ||v = || • ||f- Now, using ([6]), ([66]) and the fact that 

llnli;:^ = icH^ + ||W^|||)^/^ Vu = (t, W)£U*= U, (67) 

we obtain 

\\£\\u,v = max{||(0, -7mf^.^(v))||^ : G V, ||?7||v < 1} , 
= 7mfsmax{||^(f)||F : G V, ||f ||f < 1} , 

= 7mfa; max |\/2||w||p' : v £ V,\\v\\p < 1^ = \'^^m'rx- (68) 
Hence, by Proposition 14.3] we conclude that 4> satisfies Assumptions B.1-B.3 with || • Wu = \\ ■ \\f and 

L = Lg + \\£\\u,v/<Tv = 2-f^m'^\\A-^f. (69) 

We will now specify the prox- function pu for the set U used in the variant of Nesterov's algorithm. 
We let 

pu{u) =TT:{WlogW) + atlogt + bt + c, \/u= {t,W) eU, (70) 
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where 

a := log — , b := logn — a — 1 = logm — 1, c := a + 1. (71) 

m 



For a fixed t £ [0, 1] , it is easy to see that 



min pu(t, W) = ihit) := t log — + otlogt + bt + c, 



and that the minimum is achieved at W = tl/n. Now, 



^'(1) = log i + 1 + a(log l + l)+6 = l-logn + a + 6 = 0, 
n 



where the last equality follows from the second identity in ()7ip . These observations together with 
(j38p allow us to conclude that 

argmin„g[/p[/(n) = uq := (l,I/n), (72) 
mmueuPuiu) = V'(l) = - logn + 6 + c = 0, (73) 

where the last equality is due to second and third identities in (j7ip . Moreover, it is easy to see that 

t Tl 

Du := TiiSiKpijiu) = max t log h at logt + ht + c = c + max{0, h — logm} = 1 + log — , (74) 

m6(7 te[o,i] m m 

where the last identity is due to ([7T]) . Also, we easily see that pu{-) is a strongly differentiable convex 
function on U with modulus 

au = min(a/,^, m) (75) 

with respect to the norm || • H^^. 

In view of (169p . (174p . (175p and Corollary [421 it follows that the iteration-complexity of the variant 
of Nesterov's smooth method for finding an e-optimal solution of ()64|) and its dual is bounded by 



27m||A"i|| /2[1 + log(n/m) 



^/e y min(a/^, m) 

As a consequence of the above discussion and Corollary 14.21 obtain the following result. 

Theorem 4.6. For a given e > 0, the variant of Nesterov's smooth method, with prox- function 
defined by ^70\)-f71^, L given by l^69\) and au given by |75D with ^ = a/m , applied to (6^, finds an 
e-optimal solution of problem |g^[ ) and its dual in a number of iterations which does not exceed 



2^/27||A-i| 



a/1 + log(n/m) 



(76) 



Proof. We have seen in the discussion preceding this theorem that the iteration-complexity of 
the variant of Nesterov's smooth method for finding an e-optimal solution of ()64p and its dual is 
bounded by r(^) for any ^ > 0. Taking ^ = a/m, we obtain the iteration-complexity bound ()76p . ■ 



We observe that the iteration-complexity given in (j76p is in terms of the transformed data of 
problem We next relate it to the original data of problem The proof of the following 
corollary is similar to that of Corollary 14.51 
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Corollary 4.7. For a given e > 0, the variant of Nesterov's smooth method, with prox-function 
defined by (7U\)-(71\), L given by JgPj) and au given by ( [75| j with ^ = a/m , applied to applied to 
finds an e-optimal solution of problem Ii3y\) and its dual in a number of iterations which does not 
exceed 

^2^/27||(^^^)-i/2|K/m 



(77) 



Observe that, in view of Lemma 13.61 with X = and Theorem 13.71 (j77p is also an iteration- 
complexity bound for finding an e-optimal solution of problem ()2ip whenever 
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M M 

where the later equality is due to ([20]) . 

Based on the discussion below and in Subsection 14.21 the arithmetic operation cost per iteration 
of the variant of Nesterov's smooth method when applied to problem (j39p is bounded by 0{mpq) 
where m = min(p, g), due to the fact that its most expensive operation consists of finding a partial 
singular value decomposition of a p x q matrix h as in ()60p . Thus, the overall arithmetic-complexity 
of the variant of Nesterov's smooth method when applied to (j39p is 

O 



'^||(^r^N-i/2|| 

— -m^''^pqy/log{n/m) 



After having completely specified all the ingredients required by the variant of Nesterov's smooth 
method for solving (j64p . we now discuss some of the computational technicalities involved in the 
actual implementation of the method. 

First, for a given u £ U, the optimal solution for the maximization subproblem (j4ip needs to be 
found in order to compute the gradient of V/(u). The details here are similar to the corresponding 
ones described in Subsection 14.21 (see the paragraph containing relation (|58p ). 

In addition, each iteration of the variant of Nesterov's smooth method requires solving subproblem 
([ISp . In view of ([12]) and ([65]) . it is easy to observe that for every u = {t,W) G U, we have 
V/(tt) = ir],g{v)) for some rj G ^ and v G SR^^'?. Also, by dZD]), dZI]) and ([72]), we easily see that 
^Pu{uo) = (logn — 1)(1,— I). Using these results along with (PSP and ([55]) . we easily see that 
subproblem (j48p is equivalent to one of the form 

min {{<il + g{h))»W + at + Ti(WlogW)+atlogt} (78) 

{t,w)en 

for some a,? E and h E SR^^'?, where a and Cl are given by ([7T]) and ([38]) . respectively. 

We now discuss how the above problem can be efficiently solved. First, note that by (j38p . we 
have (t, W) E Q if, and only if, W = tW for some W E ^i- This observation together with the fact 
that TrM/^' = 1 for every W' E ili allows us to conclude that problem ()78p is equivalent to 

min \tkI + g(h))»W' + at + t\TT(W'logW') + aogt)Tr(W')] +atlogt} (79) 

iy'er2i,t6[o,i] 

= min at + {a + l)tlogt + td, (80) 

iG[0,l] 
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where 



d:= min {<^I + g{h)) • W + Tt:{W' logW'). 



(81) 



Moreover, if W' is the optimal solution of (jSip and t is the optimal solution of (|80p. then = tW' 
is the optimal solution of (|79|) . Problem (|81|) is of the form (|59p where an efficient scheme for solving 
it is described in Subsection 14. 2[ It is easy to see that the optimal solution of ([80^ is given by 



Finally, to terminate the variant of Nesterov's smooth approximation scheme, we need to properly 
evaluate the primal and dual objective functions of problem (I64p at any given point. As seen from 
(j^T|) and ([65]) . the primal objective function /(n) of can be computed by solving a problem in 
the form of (f58l) . Additionally, in view of (fT9]) and ([38l) . the dual objective function g{v) of (f64|) can 
be computed as 



5 Computational results 

In this section, we report the results of our computational experiment which compares the perfor- 
mance of the variant of of Nesterov's smooth method discussed in Subsection 14.21 for solving problem 
(j22p with the interior point method implemented in SDPT3 version 4.0 (beta) |19j on a set of ran- 
domly generated instances. 

The random instances of (j22p used in our experiments were generated as follows. We first ran- 
domly generated matrices j4 S and B £ K'^^, where p = 2q and / = Wq, with entries uniformly 
distributed in [0, 1] for different values of q. We then computed H and A for ([22|) according to the 
procedures described in Subsection 13.11 and set the parameter A in (j22p to one. In addition, all 
computations were performed on an Intel Xeon 5320 CPU (1.86GHz) and 12GB RAM running Red 
Hat Enterprise Linux 4 (kernel 2.6.9). 

In this experiment, we compared the performance of the variant of Nesterov's smooth method 
(labeled as VNS) discussed in Subsection 14.21 for solving problem (j22p with the interior point method 
implemented in SDPT3 version 4.0 (beta) [19j for solving the cone programming reformulation (I28p . 
The code for VNS is written in C, and the initial point for this method is set to be uq = I/{p + q)- It 
is worth mentioning that the code SDPT3 uses MATLAB as interface to call several C subroutines 
to handle all its heavy computational tasks. SDPT3 can be suitably applied to solve a standard cone 
programming with the underlying cone represented as a Cartesian product of nonnegative orthant, 
second-order cones, and positive semidefinite cones. The method VNS terminates once the duality 
gap is less than e = 10~^, and SDPT3 terminates once the relative accuracy is less than 10"*^. 

The performance of VNS and SDPT3 for our randomly generated instances are presented in Table 
[TJ The problem size (p, q) is given in column one. The numbers of iterations of VNS and SDPT3 are 
given in columns two and three, and the objective function values are given in columns four and five, 
CPU times (in seconds) are given in columns six to seven, and the amount of memory (in mega bytes) 
used by VNS and SDPT3 are given in the last two columns, respectively. The symbol "N/A" means 
"not available". The computational result of SDPT3 for the instance with {p,q) = (120,60) is not 



t = min 1 , exp I — 1 — 
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Table 1: Comparison of VNS and SDPT3 



Problem 




Iter 


Obj 


Time 


Memory 


(p, q) 


VNS 


SDPT3 


VNS 


SDPT3 


VNS 


SDPT3 


VNS 


SDPT3 


(20, 10) 


36145 


17 


4.066570508 


4.066570512 


16.6 


5.9 


2.67 


279 


(40, 20) 


41786 


15 


8.359912031 


8.359912046 


55.7 


77.9 


2.93 


483 


(60, 30) 


35368 


15 


13.412029944 


13.412029989 


96.7 


507.7 


3.23 


1338 


(80, 40) 


36211 


15 


17.596671337 


17.596671829 


182.9 


2209.8 


3.63 


4456 


(100, 50) 


33602 


19 


22.368563640 


22.368563657 


272.6 


8916.1 


4.23 


10445 


(120, 60) 


33114 


N/A 


26.823206950 


N/A 


406.6 


N/A 


4.98 


> 16109 



available since it ran out of the memory in our machine (about 15.73 giga bytes). We conclude from 
this experiment that the method VNS, namely, the variant of Nesterov's smooth method, generally 
outperforms SDPT3 substantially even for relatively small-scale problems. Moreover, VNS requires 
much less memory than SDPT3. For example, for the instance with {p, q) = (100, 50), SDPT3 needs 
10445 mega (~ 10.2 giga) bytes of memory, but VNS only requires about 4.23 mega bytes of memory; 
for the instance with {p,q) = (120,60), SDPT3 needs at least 16109 mega (« 15.73 giga) bytes of 
memory, but VNS only requires about 4.98 mega bytes of memory. 

6 Concluding remarks 

In this paper, we studied convex optimization methods for computing the trace norm regularized least 
squares estimate in multivariate linear regression. In particular, we explore a variant of Nesterov's 
smooth method proposed by Tseng [20] and interior point methods for computing the penalized 
least squares estimate. The performance of these methods is then compared using a set of randomly 
generated instances. We showed that the variant of Nesterov's smooth method generally substantially 
outperforms the interior point method implemented in SDPT3 version 4.0 (beta) [19]. Moreover, the 
former method is much more memory efficient. 

In Subsection 13 . 1 1 we provided an approach for simplifying problem (j4]) which changes the variable 
U, in addition to the data A and B. A drawback of this approach is that it can not handle extra 
constraints (not considered in this paper) on U. It turns out that there exists an alternative scheme 
for simplifying problem (jj]), i.e. one that eliminates the dependence of the data on the (generally, 
large) dimension /, which does not change U. Indeed, by performing either a QR factorization of A 
or a Cholesky factorization of A^A, compute an upper triangular matrix R such that R^R = A^A. 
Letting G := R~^A^B, it is straightforward to show that problem ^ can be reduced to 

niin|||G-i?C/|||, : ^(Ji(C/) < m| . (82) 

Clearly, in contrast to reformulation (1211) . the above one does not change the variable U and hence 
extra constraints on U can be easily handled. On the other hand, a discussion similar to that in 
Subsection 14.21 shows that each iteration of the variant of Nesterov's smooth method applied to (I82p . 
or its Lagrangian relaxation version, needs to solve subproblem (I58p with A replaced by R. Since R 
is an upper triangular matrix and A is a diagonal matrix, the later subproblems are much harder to 
solve than subproblems of the form ()58p . For this reason, we have opted to use reformulation ()2ip 
rather than (j82p in this paper. 



23 



Appendix 

In this section, we discuss some technical results that are used in our presentation. More specifi- 
cally, we discuss two ways of solving a constrained nonlinear programming problem based on some 
unconstrained nonlinear programming reformulations. 

Given a set % ^ X (1 5R" and functions / : X — > 3f? and h : X ^ , consider the nonlinear 
programming problem: 

/* = inf {/(x) : X G X, < 0, i = 1, . . . , A;}. (83) 

The first reformulation of (j83p is based on the exact penalty approach, which consists of solving the 
exact penalization problem 

f^* = inf {/^(x) := f{x)+^[g{x)]+ : x G X}, (84) 

for some large penalty parameter 7 > 0, where g{x) = max{/ij(x) : i = 1, . . . , k}. To obtain stronger 
consequences, we make the following assumptions about problem (j83|) : 

A.l) The set X is convex and functions / and hi are convex for each i = 1, . . . , /c; 

A. 2) /* G and there exists a point ^ X such that g{x^) < 0. 

We will use the following notion throughout the paper. 

Definition 1. Consider the problem of minimizing a real- valued function /(x) over a certain 
nonempty feasible region contained in the domain of / and let / := inf{/(x) : x G J-}. For 
e > 0, we say that x^ is an e-optimal solution of this problem if x^ G and /(x^) < e + f. 

We note that the existence of an e-optimal solution for some e > implies that / is finite. 

Theorem 6.1. Suppose Assumptions A.l and A. 2 hold and define 

_ /(x°) -r 

^' b(xO)| - 

For X £ X, define 

, , x-^6'(x)x° , ^, , [g(x)l+ , , 

Then, the following statements hold: 

a) for every x G X, the point z{x) is a feasible solution of i83\): 

b) A* = /* for every 7 > 7; 

c) for every 7 > 7 and e > 0, any e-optimal solution of i83\) is also an e-optimal solution of ll[84\ ); 

d) 2/7 > 7, e > and xl is an e-optimal solution of (8^, then the point z{xl) is an e-optimal 
solution of 



^) if 1 > 7; e ^ and x] is an e-optimal solution of ^8^, then f{xj) — f* < e and [g{x1)\^ < 
e/(7-7). 
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Proof. Let x G X be arbitrarily given. Clearly, convexity of X, the assumption that ^ X and 
the definition of z(x) imply that z{x) G X. Moreover, Assumption A.l implies that g : X ^ ?R. is 
convex. This fact, the assumption that g{x^) < 0, and the definitions of z{x) and 9{x) then imply 
that 

g{x)+e{x)g{x^) [g{x)]+-e{xMx^)\ _ 

g{z[x))<, i + ^ i + -u- 

Hence, statement (a) follows. 

To prove statement (b), assume that 7 > 7 and let x G X be given. Convexity of / yields 
(1 + 6{x))f{z{x)) < f{x) + 6{x)f{x^), which, together with the definitions of 7 and 6{x), imply that 

f^{x)-r = f{x)+i[g{xT-r 

> (1 + e{x))f{z{x)) - 0(x)/(xO) + 7[5(x)]+ - r 

= (1 + e{x)){f{z{x)) - n - e{x){f{x^) - n + 7bW]+ 

= (l + 0(x))(/(z(x))-r) + (7-7)[5(2;)]+. (86) 

In view of the assumption that 7 > 7 and statement (a), the above inequality implies that f-y{x)—f* > 
for every x G X, and hence that f-y* > f*. Since the inequality f^* < f* obviously holds for any 
7 > 0, we then conclude that f^* = f* for any 7 > 7. Statement (c) follows as an immediate 
consequence of (b). 

For some 7 > 7 and e > 0, assume now that xj is an e-optimal solution of (j84p . Then, statement 
(b) and inequality ([86ll imply that 

6 > f,ix2) - f,* > (1 + e{x2))ifizix2)) - n + (7 - 7)b(xJ)]+. (87) 

Using the assumption that 7 > 7, the above inequality clearly implies that f{z{xj)) — /* < e/(l + 
^{xl)) < e, and hence that z{x'l) is an e-optimal solution of (j83p in view of statement (a). Hence, 
statement (d) follows. Moreover, if 7 > 7, we also conclude from ((871) that [g{xl)\^ < e/(7 — 7). 
Also, the first inequality of ^ implies that f{x2) - f* < f{x7) + -f[g{x7)]+ - f* = f-y{x2) - f^* < e, 
showing that statement (e) holds. ■ 

We observe that the threshold value 7 depends on the optimal value /*, and hence can be 
computed only for those problems in which /* is known. If instead a lower bound fi < /* is known, 
then choosing the penalty parameter 7 in problem ([8^ as 7 := {f{x^) — fi)/\g{x^)\ guarantees that 
an e-optimal solution x2 of (j84p yields the e-optimal solution z{x2) of (|83p. in view of Theorem 16.1 I f c). 

The following result, which is a slight variation of a result due to H. Everett (see for example 
pages 147 and 163 of [8]), shows that approximate optimal solutions of Lagrangian subproblems 
associated with ()83p yield approximate optimal solutions of a perturbed version of (j83p . 

Theorem 6.2. (Approximate Everett's theorem) Suppose that for some x^ is an 

e-optimal solution of the problem 



fl = inf \ fix) + J2 ^Mx) : X G X I . 



I 1=1 



Then, x^ is an e-optimal solution of the problem 

f:^ = inf [fix) : X G X, hiix) < /i,(x^), i = 1, . . . , A;} . (89) 
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Proof. Let x be a feasible solution of ([89]) . Since is an e-optimal solution of ([88]) . we have 
f{x^) + Yli=i ^ /a + f • This inequality together with the definition of f^^ in ([88j) implies 

that 

fc k 

f{x^) < rx-Y,XMx^) + e < f{x)+Y,Xi[hi{x)-hi{x^)]+e < /(x)+e, 

i=l i=l 

where the last inequality is due to the fact that Aj > for all i = 1, . . . ,k and x is feasible solution 
of (j89p . Since the latter inequality holds for every feasible solution x of (j89p . we conclude that 
f{x^) < f*^ + e, and hence that x^ is an e-optimal solution of ■ 

If our goal is to solve problem inf{/(x) : x £ X, hi{x) < bi, i = 1, . . . ,k} for many different right 
hand sides 6 G JR'', then, in view of the above result, this goal can be accomplished by minimizing 
the Lagrangian subproblem (|88p for many different Lagrange multipliers A G We note that this 
idea is specially popular in statistics for the case when k = 1. 
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